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THE USE OF THE SEQUENTIAL PROBABILITY RATIO TEST 
IN MAKING GRADE CLASSIFICATIONS IN CONJUNCTION 
WIVH TAILORED TESTING 



In many testino applications, the major use of the obtained score is to 
classify a person as being above or below some criterion score. Examples of 
such uses of test results include the screening of job applicants and the 
classification of students as masters and non-masters when using the mastery 
learning paradigm (Bloom, 1971). For such applications it is not necessarily 
required that the person's ability be accurately estimated, but only that the 
measurements be sufficiently precise that the examinees can be accurately 
classi fied . 

When ma^ng such classifications, the accuracy of measurement required 
in making the decision is dependent upon how far from the cutting score the 
person is located. If the examinee is far above or below the cutting score, 
minimal accuracy will be required. If the examinee is close to the cutting 
score, high precision will be required. Since the accuracy of an ability 
estimate is dependent to a large extent on test length, it follows that shorter 
tests can be used if a person's ability were a substantial distance from the 
cutting score. Depending on the number of individuals who are far from the 
cutting score, the average length of test needed for classification might be 
substantially reduced over what is commonly used. 

Based on this analysis, an optimal procedure for testing examinees for' 
classification purposes would be to check the accuracy of classification af- 
ter each item is administered. If the accuracy were sufficiently high, test- 
ing could stop. If the accuracy were not high enough, another item would be 
administered. 

Exactly this typ* of procedure was developed by Wald (1947) to assist in 
quality control work during World War II. His procedure was designed to de- 
termine whether a batch of Darts was acceptable based on whether it contained 
a sufficiently low number of defectives. The basic concept behind the pro- 
cedure is to take an observation from the batch and determine the probability 
of the observation under the hypothesis of an acceptable or unacceptable batch. 
A ratio is formed by dividing the probability of the observation coming from 
an acceptable batch by the probability of it coming from an unacceptable batch. 
If the ratio is sufficiently large, the batch is considered acceptable and if 
it is sufficiently small, the batch is considered unacceptable. If the ratio 
is near 1.0, another observation is randomly selected. A new ratio is then 
formed using all of the previous observations. The process continues until a 
decision is reached. Because o f the sequential nature of the process, it has 
ln'i'H labeled the Sequential Probability Ratio Test (SPRT). 

Since its development, the SPRT as been widely used for quality control 
work (Govindarajulu, 1975). However, only recently has it appeared in the 
mental testing literature. Ferguson (1970) used the SPRT procedure to deter- 
mine whether 75 students had mastered material in a hierarchically arranged 
set of instructional units. His procedure randomly generated items by computer 
using item forms and then administered the items using a computer terminal. 
He found a substantial reduction in testing time and in the number of items 
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required to make a decision. The procedure was found to be in 99/ agreement 
tn the* longer tests traditionally used to make the decisions. 



No other studies were found that actually made real time decisions using 
the SPRT procedure. However, Epstein & Knerr (1978) did present the results 
of a real data simulation using Army proficiency testing response data. They 
found that only 33,*. as many items were needed for the SPRT based procedure 
without loss in decision accuracy. Sixtl (1974), Kalish (1980), and Kingsbury 
and Weiss (1980) present the results of simulation studies showing that the 
SPRT procedures result in a substantial reduction in the number of items re- 
quired to make decisions. Thus, all the research to date supports the conten- 
tion that SPRT based procedures lead to increased testing efficiency. 

Despite the promising results reported in the studies listed above, none 
of the procedures described take full advantage of the quality items in the 
item pool. That is, by randomly selecting items, the best items for making 
the classification decision may not be administered. A better procedure would 
be to select the items from the item pool that would be most informative f 0 r 
making the decision using a tailored testing paradigm. Reckase (1978) has 
shown that such a procedure could be used with the SPRT as long as local in- 
dependence could be assumed. In a series of simulation studies (Reckase, 1980a, 
1980b), he demonstrated that SPRT procedures will work with tailored testing. 
Further, a three-parameter logistic based procedure was found to give better 
results than a one-parameter logistic based procedure. 

With the positive results obtained at this time it seems prudent to eval- 
uate the quality of SPRT/tailored testing procedures for actual decisions. The 
purpose of this reporc is to present some results of the operation of the SPRT/ 
tailored testing hybrid in the context of grade classification. Further, one- 
parameter and three-parameter logistic model based procedures will be compared 
on the basis of decision consistency. The overall criterion for success will 
Ljl- a comparison with traditional grading procedures. 

The SPRT Procedure 

The SPRT procedure has been described in detail elsewhere (Wald, 1947; 
Lp^tein 4 Knerr, 1978; Reckase, 1980a) so only a brief description will be given 
here. The basic equations will be presented along with the procedures for de- 
scribing the characteristics of the decision making process. 

As described above, the basic philosophy behind the SPRT procedure, is to 
determine the probe 1 : i 1 i ty of the observed responses for two alternative hypo- 
theses and then form the ratio of the probabilities. A large ratio favors one 
of the hypotheses and a small ratio favors the other. For example, if H« is 
the hypothesis that the ability ( A ) for a person is equal to f4 , , and is the 
hypothesis that the ability equals the probability of the obtained responses, 
^. • . .i * n » given these hypotheses would be: 

n 

p(x r x 2 , . . Xfi ; 'y -/^U. '^) m (l) 

1 H 11 

P( V x 2 \"'Z ) = P(x i ! V (2) 
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under the local independence assumption of latent trait theory. The values 
of PUjj would be computed using the appropriate latent trait model assuming 
Miown item parameters from a previous item calibration. Assuming 'v'H 0 , the 
probability ratio would then be formed as 1 1 

P(x r x 2 , . . ., x n |'j ] ) ^ 

^ * 1 ' x 2 ' ' " " x n ^ ?) 

If this ratio were sufficiently large Hp would be rejected, and if the ratio 
were sufficiently small H would be rejected. Trie determination of what con- 
stitutes l^rge and small depends upon the error rates that are considered ac- 
ceptable. 



Suppose t is the probability of accepting H ] when H ? is really true and \\ 
is the probability of accepting H 2 when H is reAlly true. Wald (1947) has 
shown that a good approximation to the decision points needed for the probabil- 
ity ratio (Equation 3) can be obtained by the following two expressions: 

Upper decision point = A - ^— (4) 



^nd Lower decision point = B = y~- (5) 

Thus, if Equation 3 gives a result larger than A, H, should be accepted with 

an error rate of approximately a, and if the expression yields a value less than 

B. should be accepted with an error rate cf approximately 3. 

The procedure described above assumes that a decision is to be made between 
two simple hypotheses: H,:R=Q or H 2 :8=e 2 . Wald (1947) has generalized this 
procedure to making decisions concerning complex hypotheses such as H n :6<9 and 
H • • c This is a much more useful set of hypotheses because it matches She 
decision process used in making class! r ica t ion , dbove or below a criterion score. 

In order to test a complex hypothesis using the SPRT, an indifference region 
must first be specified around the cutting score, 0 , for the decision. The in- 
difference region is the area around the cutting s£ore in which either classifi- 
cation is considered equally good. For example, if 0 is the cutting score for 
making the decision, persons sufficiently close toe ^ouldbe classified either 
high or low without appreciable loss. Suf f iciently c cl ose is defined here as 
oeing between a and n ? when 0 >■) n> If a person were outside the region from 
} to r» 2 (in d w^re misclassif i 6d , the error would be considered seritfus. 

The use of the SPRT to test complex hypotheses works the same as for the 
simple hypotheses except that the limits of the indifference region are used in 
Equation 3 to form the probability ratio instead of the hypothesized true Values. 
The upper and lower decision points for the test are determined in exactly the 
same vay as before (Equations 4 and 5). However, now the operation of the SPRT 
is controlled not only by the a and t error rates, but also by the width of the 
indifference region. The higher the error rates and the wider the indifference 
region, the fewer tho items that need to be administered. 
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The quality of operation of the SPRT procedure is usually judged on the 
basis of two mathematical functions called the operating characteristic (OC) 
function and the average sample number (ASN) function. The OC function is 
defined as 

OC(fa) - P(classified below 6 |o}. 

This function should have values close to 1.0 for G<8 and values close to 0.0 
for "*0 c . To the extent that this function drops quickly from a value neur 1.0 
to near 0.0 in the indifference region, the SPRT procedure is working well. 

The ASN function is defined as the average number of observations needed 
to make a decision as a function of 0. This function is typically peaked, with 
high values near the cutting score and decreasing values with increased distance 
from the cutting score. Both the CC function and thf ASN function are dependent 
on the size of the error rates and the width of the indifference region. A 
narrow indifference region and/or low error rates result in a steep 0C function 
and require a large number of observations for decisions. High error rates and/ 
or a wide indifference region flatten the 0C function and reduce the number of 
observations required. Thus, the price paid for high precision is a greater 
number of observations. Mere detailed information concerning the 0C and ASN 
functions can be found in Wald (1947), Reckase (1980a), or Epstein and Knerr 
(1 978). 

Tailored Testing Procedu re 

Tailored testing procedures are defined by their methods of item selection 
and ability estimation. The procedure used in this study selects items to maxi- 
mize the value of the information function (Birnbaum, 1968) at the previous 
ability estimate. Ability was estimated using an empirical maximum likelihood 
approach. The procedure is described in detail by McKinley & Reckase (1980), so 
it will not be described again here. The above tailored testing procedure was 
used with both the one-parameter logistic ( 1 PL ) and the three-parameter logistic 
( 3 PI ) models in the study reported here. 

Tail ored Te stin g/SPRT Hybrid 

The procedure used to administer the test items in this study used compo- 
nents of both tailored testing methodology and the SPRT. Items to be adminis- 
tered in the process of the computerized test were selected using the maximum 
information criterion (Birnbaum, 1968; McKinley & Reckase, 1980). After the 
response to each item was obtained, the value of the probability ratio (Equation 
3) was computed and a decision was made to classify high, classify low, or to 
administer another item. If another item were ttf be administered, a maximum 
likelihood ability estimate was obtained and a new item was selected to maximize 
the information function at that ability estimate and administered to the exami- 
nee. The process continued until a classification decision had been made or 
until 20 items had been admini stered. After 20 items, ratios above 1.0 resulted 
in a high classification, and ratios below 1.0 resulted in low classification. 
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Research Design 

The purpose of ;hp research reported here was to compare 1PL and 3PL ba. ^ 
procedures for making classification decisions using the SPRT. Since the true 
classifications were unknown, a consistency of classification design was .used 
as a criterion for evaluation. To facilitate the comparison of decision con- 
sistency a test-retest design was used in which tailored tests .based on . both 
the 1PL and 3PL models were administered to the same individuals in two sessions 
one week apart. In the first session the 1PL and 3PL tailored tests were ad- 
ministered as described above without a break in between. From the student's 
point of view, only one test wis administered. In the second session, the same 
procedure was followed, only the order of presentation of the 1PL and 3PL pro- 
cedures was reversed to counterbalance fatigue effects. The initial order of 
presentation of the 1 PL and 3PL procedures was randomly assigned to the students. 

Within the tailored tests, three grade placement decisions were macte using 
the SPRT procedure. Based on the test information, students were placed above 
or below the A/B grade cutoff, the B/C grade cutoff, and the C/P grade cutoff. 
Thus, if a student were classified below the A/B cutoff, and above the B/C cut- 
off, a grade of B would be assigned. The grade cutoffs for the study were set 
to be consistent with those used on the traditional test using the test charac- 
teristic curve. 

Before the cutoffs could be set, the traditional test first had to be linked 
to the tailored testing item pool. This was done so that the cutoffs determined 
from the traditional test would be on the same scale as the tailored test ability 
estimates. The linking was performed using the major axis method for the 1PL 
model, and the maximum likelihood method for the 3PL model. See Reckase (1979a) 
for a more detailed description of these procedures. 

The traditional test used as a basis for the grade cutoffs was a 50 item 
multiple choice test over the area of classroom evaluation procedures. The test 
and the population of students who took part in the study were from an intro- 
ductory course on educational measurement techniques. The grade classification 
region for the traditional test in terms of raw scores were: 42-50, A; 33-41, B; 
29-32, C; and 28 and below, D. Based on these score ranges, the A/B cutoff was 
set at 41 1 ,, the B/C cutoff at 32* 2 , and the C/D cutoff at 28 l 2 . The 1 PL ability 
scale cutoffs corresponding to the raw score cutoffs were A/B, 2.24; B/C, .'95; 
and C/D, .46, The cutoffs on the 3PL ability scale were: A/B, .78; B/C, -.85; 
and C/D, -1.39. These values were determined by finding the points in the latent 
trait scales that were equivalent tc the raw score points. 

Along with the cutting points, an indifference region and the a and 3 error 
rates were needed to totally specify the SPRT procedure. A reasonable indiffer- 
ence region for the test was thought to be one standard error of measurement on 
either side of the cutting point. Based on the traditional test reliability of 
.60 for the sample of students used in the study, the standard error of measure- 
ment in 1PL and 3PL ability units was .45. Thus, the indifference regions were 
set at A/B, 2.69 to 1.79; B/C, 1.40 to .50; anj C/D, .91 to .OH for the 1PL pro- 
cedure and A/B, .23 to 1.33; B/C, -1.30 to -.40; and C/D, -1.84 to -.94 for the 
3PL procedure. The differences in indifference regions for the two procedures 
were due to differences in the way the origins of the ability scales were definea. 
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Since it was considered a more serious error to classify someone high in- 
correctly than low incorrectly, a was set at .02 and 3 was set at .10. Using 
Equations 4 and 5, the decision points for 4 the SPRT were computed to be A=45 
'ind B-.102. This resulted in a classi fication .in the higher grade category if 
tquation 3 resulted in a value greater than 45, in the lower grade category if 
the val uo was below .102, and continued testing if the result was between 45 
and .102. TfTe same A and B values were used for both the 1 PL and 3 PL procedures. 

The sample used in this study consisted of 88 student volunteers from an 
undergraduate introductory measurement course. Of the 88 students, 21 were male 
and 67 female. The gro^p consisted of 19 juniors, 67 seniors, and 2 graduate 
students. The tailored tests were administered the week following a classroom 
test over the same content. The examinees were told that the tailored test score 
would be substituted for the classroom test score if they performed better on the 
tailored test, and that they would receive extra credit points for completing the 
requirements of the study. 

Analyses 

The major analysis performed in this study was the comparison of the grade 
classifications over the test-retest period. This analysis was to show which 
procedure (1PL or 3PL) gave more consistent grade classification over the one 
week time period. Since the grade scale -yields mainly categorical results, a 
phi coefficient derived from the chi-:,quare contingency table was used for this 
analysis. The same analysis was also performed to determine which procedure 
made grade classifications that were more similar to those obtained from a tra- 
ditional classroom test. 

Along with the above analyses, the distributions of grades^ for the two 
procedures were determined and compared. The number of i terns. required for a 
decision were also tabulated for each procedure and the mean number of items 
required were compared using a two-way ANOVA. Session and procedure were the 
independent variables in this analysis, with repeated measures over both ses- 
sion and procedure . 

R esul t s 

The direct result of the tailored testing procedure in this study is the 
classification of students into grade categories using the SPRT paradigm. The 
results of this grade classification for the 1PL and 3PL tailored testing pro- 
cedure, an.d the traditional classroom test are shown in Table 1. This table 
presents the frequency distribution of the grades for each procedure and each 
testing session. The means and standard deviations are also presented to sum- 
i^ri^e the distributions even though the data are 'only ordinal. 

From these results, a tendency can be seen for the 1PL procedure to grade 
slightly easier than the 3PL procedure'. The traditional test assigned the 
highest average grade of all the procedures. This can probably be explained by 
the fact that the classroom test was the test studied for and it was taken first. 
The standard deviations of grades for the 1 PL and 3PL procedures were about the 
same, with a slight increase in the second testing session. The traditional 
t^ f ;t had the smallest standard deviation of all of the procedures. 
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Table 1 

Grade Distributions for the 1PL and 3PL Tailored "tests 
and the Traditional Classroom Test 



• _ _ Proc edure 

Sessjcm Grade 1 p"f " " " "3PL " 



Tradi ti onal 



A(4) 13 6 8 

B(3)' 60 x=Z.78 58 x=2.59 78 £=2.91 

C ( 2 ) 20 s.d.=.75 26 s.d.=.75 10 s.d.=.56 

D(D 7 10 4 



A(4) 
B(3) 
C(2) 
D(l) 



18 

54 y=2.78 
17 s.d.=.88 
11 



12 

50. x=2.65 
27 s.d.=.83 

10 



Note: The values presented in the table are percentages bf 88 cases 



The results of the con:istency of classification aWysis are presented 
in Table 2 along with a oomparison with the grades assigned by the traditional 
classroom exam over the same course content and the final grade in the course. 
<As- can be seen from this table, the consistency of the 3PL/SPRT procedure was 
substantially higher than the 1PL/SPRT procedure (phi = 938 vs. 662- t = 5 19 
P < .01 ) . ' ' * * 

Table 2 

Phi Coefficients Showing the Consistency 
of Grade Classifications arid the Relationship 
With Traditional Grading Practices 



Test 



Test 





1 PL- 1 IPL-2 


3PL-1 


3PL-2 


Course 
Exam 


Final 
Grade 


1PL-1 


.662 


.340 


.489 


.486 


.679 


IPL-2 




.448 


.645, 


.495 


/710 


3PL- 1 






.938 


.376* 


\.461 


3PL-2 








.490 


?649 



Note: All phi coefficients are based on 88 cases. 
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The relationship between the tailored testing results and the traditional 
grading schemes show a more confusing pattern. The 1 PL procedure had a corre- 
lation of around .5 with the exam grades and about .7 with the final grades. 
This was unexpected because the course exam was on the s.ime material as the 
tailored test, while the final grade was based on a composite of three exams 
oyer different content areas. The correlations of the 3 PL procedure with the 
course grade gave a similar pattern of results, but the grades assigned by 
the firsi 3PL session had lower phi coefficients. The results from the second 
testing were^aoout the same magnitude as the 1PL results. 

The d^ta on the mean number of test items required to make the grade 
classi ncations are presented in Table 3. Since the tailored testing proce- 
dures were terminated if a grade decision were nut nv je at or before 20 items, 
the table also gives the percent oi cases making classifications in 20 items 
or ie As can be seen from this table, the 1 PL procedure seldom was able to 
make edification decisions in 20 items or less, while about half the time 
tie 3PL procedure could. Overall, the 3PL procedure required significantly 
fewer items to make a decision thar the 1 PL procedure (x=13.4l vs. 18.14)/ 
Significantly fewer items were also required for the second testing session. 
The AN OVA on the number of items required for classification is given in 
Table 4. The low number of items required for a grade classification is even 
more dramatic when compared to the 50 items used to make the grade classifi- 
cations with the traditional test. 



Table 3 





Average Number of Items Required 
To Make Grade Classifications 
by Procedure and Session 










Procedure 




Sess ion 




1PL 




3PL 


1 


2 


"1 


. . . 2 


Percent using -'- 

j t , ur ' e . j 


'j.7l 


b.bCJ 


50.00 


53.40 


■< f or cases 

"1 1 * ■ M"l o r 1 C J . 


1 I.JM 


H.:»0 


9.02 


11 .80 


x for all case (N=38) 


13.61 


17.66 


13.97 


U 35 


for all ^ases 


2.85 


4.00 


4.94 


5.00 
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Table 4 

ANOVA Results on Number of Items Administered With 
Model and Session as Independent Variables and 
Repeated Measures on Poth Variable^ 



Source 


SS 


df 


MS 


F 


.... P 


Model 


1966.55 


1 


1966.55 


96.55 


.00 


Sess ion 


94.10 


1 


94.10 


6.59 


.01 


Model x Session 


.56 


1 


.56 


.03 


.85 


Error (model ) 


1771.95 


87 


20.37 






Error (session) 


1242.40 


87 


14.28 






Error- (interaction) 


1397.94 


87 


16.07 







Pi scuss i on 

The major thesis of this paper is that the number of items reouired to 
make a derision concerning the classification of individuals above or below d 
cutting score can be substantially reduced from the number traditionally used. 
This can be done because abilities far removed from the cutting sco^e need not 
be measured as precisely as those who are near the cutting score. In order to 
implement a testing procedure that can modify the length of the test as a func- 
tion of the examinee's ability, a tailored testing procedure based on maximum 
information item selection and maximum likelihood ability estimation (McKinley 
and Recka^e, 1980) was combined with Wald's (1947) Sequential Probability Ratio 
Test. 

Common wisdom in test theory indicates that in order to accurately classify 
individuals into two groups, the items should be selected to be most informative 
at the cutting score (Lord & Nov' k, 1968). This could be done in this situation 
by sele cing items with maxima * ^nation at the cutting score and using the 
usual SPRY procedure. Howeve , n's case three cutting scores were present 

(A/B, B/C, C/D) so the usual taMored testing item selection procedure of choosing 
items to give maximum information at the most recent ability estimate was used. 

Beyond demonstrating the economics of the tailored testing/SPRT hybrid over 
traditional testing, the purpose of this paper was to compare tailored tests 
based on the IP 1 model with tailored tests based on the 3PL model. The results 
showed that the 3PL procedure is clearly more consistent than the 1PL procedure, 
but that the relationship to the grades based on the classroom tests was about 
the same or a little worse fcr the 3PL procedure. This may be explained by the 
•fact that the 1PL model tends to give ability estimates that are the sum of the 
components in a test while the 3PL based tests tend to give ability estimates 
that are more pure measures of the first principal component of a test ( cp e 
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Geckase, 1 979, for a more thorough discussion). Tnc larger correlations with 
the final grades than with the exam grades is probably due to the higher relia- 
bility of the final composite based on the sum of three exams. The generally 
low correlations with the course grades were probably due to the low reliability 
of the course exams (.60) and differences in method variance. 

The test length analyses resulted in several interesting findings. First, 
the 1PL based procedure had great difficulty in classifying students into grade 
categories with less than 20 items. The three parameter procedure could make 
the classification with less than 20 items about half the time. On the average, 
the 3PL procedure required about 5 items less for classification than the 1 PL 
procedure. This snorter test length with higher consistency of classification 
is probably a result of the advantage obtained by using the item discrimination 
parameter in item selectic.. Since the 1 PL procedure assumes that all items are 
of equal discriminating power, only the nearness of the item difficulty parameter 
to the most recent ability estimate affects item selection. In selecting items 
using maximum information with the 3PL procedure, discrimination, guessing, and 
difficulty parameters contribute to selection. This results in the administra- 
tion of higher quality items overall. The fewer test items required in the 
second session may be due to greater familiarity with the testing system result- 
ing in fewer mistakes in using the terminals. McKinley & Reckase (1980) give 
more details concerning the characteristics of the items actually administered 
in this study. 

Summary and Conclusions 

^ 

The purpose of this paper has been to compare two tailored testing based 
decision making procedures using the Sequential Probability Ratio Test. The 
procedures were based on the one-parameter logistic model and the three-para- 
meter logistic model. The procedures were also compared to traditional paper 
and pencil test based grades. 

The results of the study showed that the 3PL based tailored test/SPRT pro- 
cedure had higher decision consistency and required fewer test items than the 
1 PL based procedure. The tailored testing/SPRT procedure also required sub- 
stantially fewer items than the traditional classroom test (x=13.4 vs. 50). 
These results indicate that a substantial increase in efficiency can be obtained 
through the use of tailored testing/SPRT procedures, but that the grades assigned 
may not be the same as those given using a traditional method. Of the two pro- 
cedures used in this study, the 3PL based method was superior to the 1PL method 
in decision consistency and number of items required. Both procedures had ab~ut 
tht same correlations with the traditional grades. 
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